RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389) by KavinKrishnan · Pull Request #2652 · PrimeIntellect-ai/prime-rl

KavinKrishnan · 2026-05-27T15:19:51Z

Summary

This is a doc-only RFC layered on top of #2389. Proposes the next phase of work once #2389 merges to main:

Phase 1 — six surgical fixes against the merged nixl_mx code (close the bug classes we hit during GB200 bring-up: cross-subnet add_remote_agent full-mesh, stale READY peer dedup, heartbeat / STALE-on-shutdown, hardcoded 1200 s timeout, non-MLA model guard for update_mla_absorbed_weights, HSDP barrier ordering).
Phase 2 — graduate src/prime_rl/transport/mx_rendezvous.py (~185 LOC of in-tree rendezvous code) onto NVIDIA's published modelexpress Python clients (MxV2TrainingPublisher / MxV2RefitReceiver). Inherits heartbeat + freshest-per-rank dedup + retention + the v2 sidecar filter (modelexpress PR #295) for free. The in-tree NixlAgentWrapper, Slot, TransportPlan, and classic_cuda_pool stay untouched — that's prime-rl-specific data-plane specialization.
Phase 3 — fixes the trainer-side kernel-compile pinning issue surfaced during the FP8 cast-pipeline iteration on this branch. Trainer publishes HF-raw bytes (kernel-agnostic) over NIXL; inference compiles into its target layout (DeepGemm, cutlass, …) via a receiver-side scratch-buffer pass. Extends the v2 shape registry with a compile_target + compile_metadata field so receivers filter on compatibility. Heterogeneous fleets (DeepGemm and cutlass on the same training run) now work without trainer-side branching.
Phase 4 — generalizes the v2 sharding metadata to handle mixed-TP / mixed-EP via TargetTPLayout + multi-source slice discovery. Same machinery as our NemoRL v2 MoE expert filtering, generalized to dense matmul axes.

The plan pulls heavily on the NemoRL × Dynamo path (NVIDIA, @jthomson04) which is already running cross-node at 380 Gbps on GB300 RoCE for an 8.82 GB / 399-tensor refit on Qwen3-4B-Thinking-2507 — same scratch-buffer + worker_extension_cls pattern this plan adopts.

What's in this PR

Doc only. 517 lines at docs/proposals/post-pr2389-kernel-compile-plan.md. Includes component + per-refit sequence diagrams (mermaid). No code changes; implementation phases sequence behind this RFC's acceptance.

Why a draft RFC against `nixl_mx`

The plan only makes sense in the context of this branch's code. Targeting main now would dangle (no nixl_mx to build on). Plan: re-target to main once #2389 merges, then land Phase 1 quickly as a follow-up PR.

Estimated impact

Phase	Net LOC
1 — surgical fixes	~100 (in-tree)
2 — client graduation	−400 (`mx_rendezvous.py` deleted) + 150 (import-and-call)
3 — compile-target registry + receiver-side compile passes	~+45 modelexpress, ~+350 prime-rl
4 — mixed-TP / mixed-EP slice discovery	~+200 across both repos

Total ~450 LOC additive for Phases 3-4, plus the ~−400 LOC subtraction from Phase 2 maintenance burden.

Test plan

N/A — doc only. Each implementation phase ships its own test plan in the doc (see §8). Phase 3 validation piggybacks on the existing NemoRL+Dynamo GB300 cluster to de-risk the compile-pass design before porting into the prime-rl worker.

Note

Low Risk
Documentation only; no production code, config, or transport behavior changes in this PR.

Overview
Adds docs/proposals/post-pr2389-kernel-compile-plan.md, a doc-only RFC (~517 lines) for work after PR #2389 lands. It does not change runtime code.

The proposal keeps the existing nixl_mx data plane (Slot, TransportPlan, NixlAgentWrapper, pools) and plans rendezvous/metadata extensions only:

Phase 1: Six targeted fixes (same-rank remote agents, freshest-per-rank dedup, heartbeat/STALE, configurable timeouts, MLA guard, HSDP barrier order).
Phase 2: Replace in-tree MxRendezvous with ModelExpress MxV2TrainingPublisher / MxV2RefitReceiver; adopt worker_extension_cls on the vLLM worker.
Phase 3: Move kernel layout compile to inference via scratch buffers + pluggable CompilePass (hf_raw, DeepGemm, cutlass); extend the v2 registry with compile_target / compile_metadata and compile_target_filter on discovery.
Phase 4: Mixed TP/EP via TargetTPLayout, slice-aware receive_weights_scratch, and multi-source discover_v2_sources_for_slice.

The doc includes mermaid architecture/sequence diagrams, phased LOC estimates, open questions, and links to ModelExpress/NemoRL validation paths (e.g. scratch refit on GB300).

^{Reviewed by Cursor Bugbot for commit 7feee0d. Bugbot is set up for automated code reviews on this repo. Configure here.}

…le, mixed-TP, MX clients Proposes the next phase of work on top of `nixl_mx` once PrimeIntellect-ai#2389 merges: 1. Phase-1 — six surgical fixes against the in-tree code that close the bug classes we hit during GB200 bring-up (cross-subnet add_remote_agent full-mesh; stale READY peer dedup; heartbeat / STALE-on-shutdown; hardcoded 1200s timeout; non-MLA model guard; HSDP barrier ordering). Line-pinned against HEAD `79ea824d8`. 2. Phase-2 — graduate `src/prime_rl/transport/mx_rendezvous.py` onto NVIDIA's published `modelexpress` Python clients (`MxV2TrainingPublisher` / `MxV2RefitReceiver`). Deletes ~185 LOC of in-tree rendezvous that duplicates the upstream client. Inherits heartbeat + freshest-per-rank dedup + retention + sidecar-filter for free. `NixlAgentWrapper` / `Slot` / `TransportPlan` / `classic_cuda_pool` stay — those are prime-rl specialization. 3. Phase-3 — solves the trainer-side kernel-compile issue surfaced during PrimeIntellect-ai#2389's FP8 cast-pipeline iteration. Trainer publishes HF-raw bytes (kernel-agnostic); inference compiles into its target layout (DeepGemm, cutlass, ...) via a receiver-side scratch-buffer pass. Extends the v2 shape registry with `compile_target` + `compile_metadata`. Heterogeneous fleets (mixed kernels on the same training run) now work without trainer-side branching. 4. Phase-3 also generalizes the v2 sharding metadata to handle mixed-TP/EP via `TargetTPLayout` + multi-source slice discovery in the same machinery NemoRL v2 uses for MoE expert filtering. Pulls heavily on the NemoRL × Dynamo path (NVIDIA, John Thompson) which is already running at 380 Gbps on GB300 RoCE for an 8.82 GB refit — same scratch-buffer + worker-extension-cls pattern this plan adopts. Component + per-refit sequence diagrams (mermaid) included. Estimated ~450 LOC additive across modelexpress + prime-rl for Phases 3-4 (plus the ~400 LOC subtraction from Phase 2). Doc only. Implementation phases sequenced behind the upstream merge of PrimeIntellect-ai#2389.

… (v0.7.x) Captures the empirical findings from baking PRs #1 and #2 into an ARM64 GB200 image and running it on the kavin namespace for 8+ hours on Qwen3-30B-A3B-Instruct-2507 with gsm8k. Documents three real surprises the unit tests didn't cover: 1. Dockerfile.cuda's `uv sync` is missing `--extra disagg`, so modelexpress isn't installed in stock images; inference workers crash at the first import. Shipped v0.7.1 as a one-line overlay that adds the extra until the upstream Dockerfile.cuda can be updated. 2. `LD_PRELOAD` path for libcudart.so.12 — v0.5.2 had /usr/local/cuda present in the final stage; v0.7.0 (built from upstream Dockerfile.cuda as-is) doesn't. The pip-installed wheel path (/app/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/) is the new canonical location. 3. The configmap monkeypatch (patch_nixl_mx.py) and Phase 2's source-baked fixes are complementary — they patch different layers (broadcast vs rendezvous-wait) and both should stay until PR #1 merges upstream. Build experience numbers: - v0.7.0 from-scratch ARM64 build under QEMU: 6h45min (uv sync 45m, flash-attn from source 3h45m). - v0.7.1 overlay on top of v0.7.0: ~3 min. Cluster observations from v0.5.2 + configmap monkeypatch (the runtime-patched path our PR #1 codifies into source): - 183 successful RL refit cycles in one 66-min uninterrupted window - Reward variance 0.5-1.0 across orchestrator steps (real learning) - Off-policy level = 0 throughout - Zero NIXL data-plane errors - Recurring orchestrator wait_for_all_peers_ready timeout (~once per 30-66 min) is the exact bug class Phase 2's rendezvous-level dedup eliminates Also notes seven RFC updates queued in pensieve/RL/PrimeRL/09_rfc_updates_needed.md, three of which are new from this build experience (disagg extra, LD_PRELOAD path, vLLM PR #43375 / Anyscale RDT positioning). Companion to the RFC at docs/proposals/post-pr2389-kernel-compile-plan.md.

…/3/4 upstream form vLLM published https://vllm.ai/blog/2026-05-28-native-rl-apis the same day, announcing a standardized WeightTransferEngine abstract base + 4-phase lifecycle (init / start / update / finish) + a pluggable WeightTransferEngineFactory.register_engine(...) extension point. This is the upstream integration seam that the in-tree MxRendezvous reimplementation in PR PrimeIntellect-ai#2389 and the worker_extension_cls injection in inference/vllm/worker/nixl_mx.py have been emulating. The cleanest form of all our Phase 2/3/4 work upstream is a single MxWeightTransferEngine adapter (~150-200 LOC) that subclasses WeightTransferEngine and wraps the existing MxV2RefitReceiver + MxV2TrainingPublisher. Three immediate consequences captured in §8: §8.1 — Phase 2/3/4 should be repackaged as MxWeightTransferEngine for upstream contribution; the existing patches stay correct, the packaging just becomes upstream-native. §8.2 — The blog credits Matej Sirovatka specifically. He's likely mid-flight on a native-APIs rewrite of prime-rl's nixl_mx broadcast. Ask him before pushing Phase 2 upstream; the work may retarget to the adapter path directly. §8.3 — Their validation was at 16x 8xH200, DPEP32, 256 GPUs total. That scale makes Phase 4's multi-source slice planning load-bearing (mixed-TP/EP is the common case), not optional. Validates the design direction and sets the next cluster validation target after the DP=4 kavin smoke. §8.4 — pause_generation(mode="keep") + two-phase DPEP pause are features we don't yet match. Keep mode unlocks true async RL; queue after Phase 2 lands. Updated follow-up list grows from 4 to 7 items, with the three new ones being: write MxWeightTransferEngine, adopt keep-mode pause in the orchestrator, and coordinate with Robert Shaw / the vLLM RL roadmap on the K8s-native weight transfer engine they mention as ongoing work (which describes MX itself, modulo who's driving the upstream PR).

…three docs The three proposal docs now form a coherent set: - post-pr2389-status-and-plan.md — executive summary; failure-class to fix mapping; mermaid diagram of the data + metadata planes; Phase 0 unblock guidance - post-pr2389-kernel-compile-plan.md — full RFC with phase-by-phase design rationale (unchanged except for cross-link header) - build-notes-2026-05-28.md — operational findings from the source-baked image build, plus the vLLM native RL APIs reframe in section 8 Each doc now has a header block linking to the other two so readers can navigate based on intent (status vs design vs operational). The status-and-plan doc is the natural entry point for someone coming to the work cold; the RFC and build-notes are the deep dives.

KavinKrishnan added 2 commits May 27, 2026 08:06

docs(proposals): scrub stray internal-pensieve reference

7feee0d

KavinKrishnan marked this pull request as draft May 27, 2026 15:20

KavinKrishnan added 3 commits May 28, 2026 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389)#2652

RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389)#2652
KavinKrishnan wants to merge 5 commits into
PrimeIntellect-ai:nixl_mxfrom
KavinKrishnan:kavink/post-2389-kernel-compile-plan

KavinKrishnan commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KavinKrishnan commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Why a draft RFC against nixl_mx

Estimated impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KavinKrishnan commented May 27, 2026 •

edited by cursor Bot

Loading

Why a draft RFC against `nixl_mx`